Building a Parallel Multilingual Corpus (Arabic-Spanish-English)

نویسندگان

  • Doaa Samy
  • Antonio Moreno-Sandoval
  • José María Guirao
  • Enrique Alfonseca
چکیده

This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resource for the NLP community that completes the present panorama of parallel corpora. In the first part of this study, we introduce the novelty of our approach and the challenges encountered to create such a corpus. This introductory part highlights the main features of the corpus and the criteria applied during the selection process. The second part focuses on two main stages: basic processing (tokenization and segmentation) and alignment. Methodology of alignment is explained in detail and results obtained in the three different linguistic pairs are compared. POS tagging and tools used in this stage are discussed in the third part. The final output is available in two versions: the non-aligned version and the aligned one. The latter adopts the TMX (Translation Memory Exchange) standard format. At the end, the section dedicated to the future work points out the key stages concerned with extending the corpus and the studies that can benefit, directly or indirectly, from such a resource. 1. The LLI-UAM Multilingual Parallel Corpus: A New Resource * 1.1. State-of-the-art Much work has been carried out in the field of developing parallel corpora either bilingual or multilingual. However, in our opinion, there are two main reasons behind the uniqueness and novelty of our corpus. Both reasons are directly related to the state-of-the-art in the field. First, there is a significant gap between the number of resources available for English and Spanish, on one hand, and the resources available for Arabic, on the other hand. This unbalance is reflected on the studies concerning parallel corpora and especially those dealing with Arabic. In most of the cases, they are bilingual studies in combination with English. The results of the survey we conducted to locate Arabic parallel corpora prove this fact. There are the four corpora available through the LDC: 1. UN Arabic English Parallel Text (LDC2004E13) 2. Arabic News Translation Text Part 1 (LDC2004T17) 3. Multiple Translation Arabic (MTA) Part 1 (LDC2003T18) 4. Arabic English Parallel News Part 1 (LDC2004T18) Second, major initiatives aiming at developing multilingual corpora have been taken within the framework of various European projects such as CRATER (Garside et al. 1994), MULTEXT (Ide & Veronis 1994), and ECI/MCI. More recent are the initiatives of OPUS (Tiedemann & Nygaard 2004) and EUROPARL (Koehn 2005). Therefore, the coverage is limited to the European languages and Arabic language is not included. Taking into consideration both factors, we insist on the fact that the corpus, we are presenting here, is the first * Part of this research has been supported by the grant TIN200407588-C03-02 (Spanish Ministry of Education and Science). parallel corpus offering the following language combination (Arabic-Spanish-English). 1.2. Building the corpus The selection process was characterized by a number of challenges and difficulties to meet the established criteria in terms of quality and quantity. Finding a considerable quantity of quality texts available in the three languages was our main endeavor. The quality in this case is directly related to the nature, source and the translation of the selected texts. Representativeness, availability in electronic format and legal use are other relevant issues in this stage. To apply these criteria, the following decisions were taken: 1. Texts should not be automatically translated. 2. Texts should represent the modern standard use of the language 3. Sources should be freely available in electronic format. 4. Author’s copyrights should be respected and the use of the text should be within the principle of Fair Use. Opting for the United Nations documents was the most practical and feasible solution. The reasons behind could be summarized in the following: 1. Arabic, Spanish and English are among the official languages of the Organization. 2. Translation quality is guaranteed. 3. Texts represents a modern standard use of the language. 4. Documents are available freely and in considerable quantities. 5. UN explicitly states that using texts for academic purposes is considered a “fair-use”.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Spanish Language Processing at University of Maryland: Building Infrastructure for Multilingual Applications

We describe here our construction of lexical resources, tool creation, building of an aligned parallel corpus, and an approach to automatic treebank creation that we have been developing using Spanish data, based on projection of English syntactic dependency information across a parallel corpus.

متن کامل

Pragmatic Annotation of Discourse Markers in a Multilingual Parallel Corpus (Arabic- Spanish-English)

Discourse structure and coherence relations are one of the main inferential challenges addressed by computational pragmatics. The present study focuses on discourse markers as key elements in guiding the inferences of the statements in natural language. Through a rule-based approach for the automatic identification, classification and annotation of the discourse markers in a multilingual parall...

متن کامل

Multi-document multilingual summarization corpus preparation, Part 1: Arabic, English, Greek, Chinese, Romanian

This document overviews the strategy, effort and aftermath of the MultiLing 2013 multilingual summarization data collection. We describe how the Data Contributors of MultiLing collected and generated a multilingual multi-document summarization corpus on 10 different languages: Arabic, Chinese, Czech, English, French, Greek, Hebrew, Hindi, Romanian and Spanish. We discuss the rationale behind th...

متن کامل

Construction of a Bilingual Arabic-Spanish Lexicon of Verbs Based on a Parallel Corpus

Parallel corpora are considered an important resource for the development of linguistic tools. In this paper our main goal is the development of a bilingual lexicon of verbs. The construction of this lexicon is possible using two main resources: I) a parallel corpus (through the alignment); II) the linguistic tools developed for Spanish (which serve as a starting point for developing tools for ...

متن کامل

The AMARA Corpus: Building Resources for Translating the Web’s Educational Content

In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006